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Abstract. Assembling stiffness matrices represents a significant cost in many finite element 
computations. We address the question of optimizing the evaluation of these matrices. By finding 
redundant computations, we are able to significantly reduce the cost of building local stiffness ma- 
trices for the Laplace operator and for the trilinear form for Navier-Stokes. For the Laplace operator 
in two space dimensions, we have developed a heuristic graph algorithm that searches for such re- 
dundancies and generates code for computing the local stiffness matrices. Up to cubics, we are able 
to build the stiffness matrix on any triangle in less than one multiply-add pair per entry. Up to 
sixth degree, we can do it in less than about two. Preliminary low-degree results for Poisson and 
Navier-Stokes operators in three dimensions are also promising. 
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1. Introduction. It has often been observed that the formation of the matrices 
arising from finite element methods over unstructured meshes takes a substantial 
amount of time and is one of the primary disadvantages of finite elements over finite 
differences. Here, we will show that the standard algorithm for computing finite 
element matrices by integration formulae is far from optimal and present a technique 
that can generate algorithms with considerably fewer operations even than well-known 
precomputation techniques. 

From fairly simple examples with Lagrangian finite elements, we will present 
a novel optimization problem and present heuristics for the automatic solution of 
this problem. We demonstrate that the stiffness matrix for the Laplace operator 
can be computed in about one multiply-add pair per entry in two-dimensions for up 
to cubics, and about two multiply-add pairs up to degree 6. Low order examples in 
three dimensions suggest similar possibilities, which we intend to explore in the future. 
More importantly, the techniques are not limited to linear problems - in fact, we show 
the potential for significant optimizations for the nonlinear term in the Navier-Stokes 
equations. These results seem to have lower flop counts than even the best quadrature 
rules for simplices. 

Our long-term goal is to develop a "form compiler" for finite element methods. 
Such a compiler will map high-level descriptions of the variational problem and finite 
element spaces into low-level code for building the algebraic systems. Currently, the 
Sundance project [HIGEII an< ^ *he DOLFIN project [TO] are developing run-time C++ 
systems for the assembly variational forms. Recent work in the PETSc project [H] 
is leading to compilation of variational forms into C code for building local matrices. 
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Our work here complements these ideas by suggesting compiler optimizations for such 
codes that would greatly enhance the run-time performance of the matrix assembly. 

Automating tedious but essential tasks has proven remarkably successful in many 
areas of scientific computing. Automatic differentiation of numerical code has allowed 
complex algorithms to be used reliably [5]. Indeed the family of automatic differen- 
tiation tools [1] that automatically produce efficient gradient, adjoint, and Hessian 
algorithms for existing code have been invaluable in enabling optimal control calcu- 
lations and Newton-based nonlinear solvers. 

2. Matrix Evaluation by Assembly. Finite element matrices are assembled 
by summing the constituent parts over each element in the mesh. This is facilitated 
through the use of a numbering scheme called the local-to-global index. This index, 
t(e, A), relates the local (or element) node number, A G C, on a particular element, in- 
dexed by e, to its position in the global node ordering [3] . This local-to-global indexing 
works for Lagrangian finite elements, but requires generalizations for other families 
of elements. While this generalization is required for assembly, our techniques for 
optimizing the computation over each element can still be applied in those situations. 

We may write a finite element function / in the form 

E E /ws (2- 1 ) 

e xec 

where fi denotes the "nodal value" of the finite element function at the i-th node in 
the global numbering scheme and {4> e x : A £ £} denotes the set of basis functions on 
the element domain T e . By definition, the element basis functions, <f>\, are extended 
by zero outside T e . In many important cases, we can relate all of the "element" basis 
functions </>| to a fixed set of basis functions on a "reference" element, T, via some 
mapping of T to T e . This mapping could involve changing both the "x" values and 
the "(f)" values in a coordinated way, as with the Piola transform [2], or it could be 
one whose Jacobian is non-constant, as with tensor-product elements or isoparametric 
elements [3]. But for a simple example, we assume that we have an affine mapping, 
£ ->■ J£ + x e , of T to T e : 

= & (J-^X-Xe)). 

The inverse mapping, x — > £ = J^ 1 (x — x e ) has as its Jacobian 

j-l _ d&n 

mj dx 3 ' 

and this is the quantity which appears in the evaluation of the bilinear forms. Of 
course, det J — 1/ det J -1 . 

The assembly algorithm utilizes the decomposition of a variational form as a sum 
over "element" forms 

a(v, w) = a e {v, w) 

e 

where the "element" bilinear form for Laplace's equation is defined (and evaluated) 
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via 



a e (v,w) := J Vv(x) ■ Vw(x) dx 



dx 3 



w( J£ + x e ) det(J) d£ 



7r. ^ 1 ^fcfe WA( °) X 

' j,m,m' = l J \Ae£ / 



/ « t (e,l) ^ 


t 


/ «><.(e,l) \ 




K e 








W(e,KI)/ 



Here, the element stiffness matrix, K e , is given by 



j,m,m' — 1 J J 



m' i . / T \ 7v- 
7^1^ det ( J )^,™,m' 



j,m,m f — 1 



m,m' — 1 



where 



and 



A' 



d 



0& 



J = l J J 



for A, fi G £ and m, m' = 1, . . . , d. 

The matrix associated with a bilinear form is 



Aij := a(fa,fa 



) = ^a e {fa,fa } ) 



(2.2) 



(2.3) 



(2.4) 



(2.5) 



(2.6) 



for all where fa denotes a global basis function. We can compute this again by 
assembly. 
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Table 3.1 

The tensor K for quadratics represented as a matrix of two by two matrices. 
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First, set all the entries of A to zero. Then loop over all elements e and local 
element numbers A and /i and compute 



A.(e,A),z.(e, A1 )+ - X! G m,m' K - 



where , are defined in ( 2.5 ) . One can imagine trying to optimize the computation 
of each 

K\,ft — G m,m'K\,n,m,m' (2-8) 

but each such term must be computed separately. We consider this optimization in 
Section [U 

3. Computing the Laplacian stiffness matrix with general elements. 

The tensor K\^ m ^ n can be presented as an \C\ x \C\ matrix of d x d matrices, as 



presented in Table 3.1 for the case of quadratics in two dimension. The entries of 
resulting matrix K e can be viewed as the dot (or Frobenius) product of the entries of 
K and G e . That is, 

Kf^ = K x ^ : G e (3.1) 

The key point is that certain dependencies among the entries of K can be used to 
significantly reduce the complexity of building each K e . For example, the four 2x2 
matrices in the upper-left corner of Table |3.1| have only one nonzero entry, and six 
others in K are zero. There are significant redundancies among the rest. For example, 
K 3j i = — 4K4,!, so once K44 : G e is computed, K 3i i : G e may be computed by only 
one additional operation. 

By taking advantage of these simplifications, we see that each K e for quadratics in 
two dimensions can be computed with at most 18 floating point operations (see Section 



3.2 ) instead of 288 floating point operations using the straightforward definition, an 
improvement of a factor of sixteen in computational complexity. 

The tensor ICx ^ m ,n f° r t ne case of linears in three dimensions is presented in 
Table [3~2| Each K e can be computed by computing the three row sums of G e , the 
three column sums, and the sum of one of these sums. We also have to negate all of 
the column and row sums, leading to a total of 20 floating point operations instead 
of 288 floating point operations using the straightforward definition, an improvement 
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Table 3.2 

The tensor K (multiplied by four) for piecewise linears in three dimensions represented as a 
matrix of three by three matrices. 
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of a factor of nearly fifteen in computational complexity. Using symmetry of G e (row 
sums equal column sums) we can reduce the computation to only 10 floating point 
operations, leading to a improvement of nearly 29. 

We leave as an exercise to work out the reductions in computation that can be 
done for the case of linears in two dimensions. 

3.1. Algorithms for determining reductions. Obtaining reductions in com- 
putation can be done systematically as follows. It may be useful to work in rational 
arithmetic and keep track of whether terms are exactly zero and determine common 
divisors. However, floating point representations may be sufficient in many cases. We 
can start by noting which sub-matrices are zero, which ones have only one non-zero 
element and so forth. Next, we find arithmetic relationships among the sub-matrices, 
as follows. 

Determining whether 

Ka,(i - cK v , w (3.2) 
requires just simple linear algebra. We think of these as vectors in d 2 -dimensional 



space and just compute the angle between the vectors. If this angle is zero, then (3.2 1 
holds. Again, if we assume that we are working in rational arithmetic then c could 
be determined as a rational number. 

Similarly, a third vector can be written in terms of two others by considering its 
projection on the first two. That is, 

Ka, p = eiK A i iJ(i i + c 2 K A 2 iAI 2 (3.3) 

if and only if the projection of K A , M onto the plane spanned by K A i p i and K A 2 ^ is 
equal to Ka, m . 

Higher-order relationships can be determined similarly by linear algebra as well. 
Note that any d 2 + 1 entries K A ai will be linearly dependent, since they are in d 2 - 
dimcnsional space. Thus we might only expect lower-order dependences to be useful 
in reducing the computational complexity. 



We can then form graphs that represent the computation of K e in (3.1 1. The 
graphs have d 2 nodes representing G e as inputs and \C\ x |£| nodes representing 
the entries of K e as outputs. Internal edges and nodes represent computations and 
temporary storage. The inputs to a given computation come directly from G e or 
indirectly from other internal nodes in the graph. 
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The computation represented in (2.7) would correspond to a dense graph in which 
each of the input nodes is connected directly to all \C\ x \C\ output nodes. It will be 
possible to attempt to reduce the complexity of the computation by finding sparse 
graphs that represent equivalent computations. 

We can generate interesting graphs by analyzing the entries of Ka, m for relation- 
ships as described above. It would appear useful to consider 

• entries K> iAl which have only one non-zero element 

• entries K> jAl which have 2 < k << d 2 non-zero elements 

• entries K^ jA1 which are scalar multiples of other elements 

• entries K^ jAl which are linear combinations of other elements 

and so forth. For each graph representing the computation of K e , we have a precise 
count of the number of floating point operations, and we can simply minimize the 
total number over all graphs generated by the above heuristics. Our examples indicate 
that computational simplications consisting of an order of magnitude or more can be 
achieved. 

We have developed a code called FErari (for Finite Element ReARrangemnts of 
Integrals) to carry out such optimizations automatically. We describe this in detail 
in Section |4.3| This code linearizes the graph representation discussed above, tak- 
ing a particular evaluation path that is based on heuristics chosen to approximate 
optimal evaluation. We have used it to show that for classes of elements, significant 
improvements in computational efficiency are available. 

3.2. Computing K for quadratics. Here we give a detailed algorithm for 
computing K for quadratics. Thus we have 6 * K e = 



f3G n 


— G\2 


7n 


7o 


4G 12 







3G22 
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4G 2 i 


7i 


7n 


722 


3(711 + 722) 


7o 
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7o 
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—73 — 8G 22 


73 


4G 21 


4G 12 





—73 — 8G22 


72 


-73 ~ 




7i 


7i 


73 


-73 - 8G n 


72 



8G11 



(3.4) 



where the G^'s are the inputs and the intermediate quantities 7* are defined and 
computed from 



70 = - 4711, 

71 = - 4 722, 

72 =4Gi22i + 8Gn22 — 73 + 8712 = 8(Gi2 + 712) 

73 =4Gi221 = 4721 = 8G12 



(3.5) 



where we use the notation G 



ijki 



Gij + Gfc£, and finally the 7y's are 



7n — Gn + G12 — G1112 712 — G11 + G22 — G1122 
721 = G12 + G21 = G1221 722 = G12 + G22 = G1222 



(3.6) 



Let us distinguish different types of operations. The above formulas involve (a) 
negation, (b) multiplication of integers and floating-point numbers, and (c) additions 
of floating-point numbers. Since the order of addition is arbitrary, we may assume 
that the operations (c) are commutative (although changing the order of evaluation 
may change the result). Thus we have G1222 = G2 2 i2 and so forth. The symmetry of 
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G implies that G1112 = G1121 and G2122 = Gi222- The symmetry of G implies that 
K e is also symmetric, by inspection, as it must be from the definition. 

The computation of the entries of K e procees as follows. The computations in 



(3.6) are done first and require only four (c) operations, or three (c) operations and 
one (b) operation (721 = 2G12). Next, the 7i's are computed via ( |3.5| , requiring four 
(b) operations and one (c) operation. Finally, the matrix K e is completed, via three 
(a) operations, seven (b) operations, and three (c) operations. This makes a total 
of three (a) operations, twelve (b) operations, and three (c) operations. Thus only 
eighteen operations are required to evaluate K e , compared with 288 operations via 
the formula (2 



It is clear that there may be other algorithms with the same amount of work 
(or less) since there are many ways to decompose some of the sub-matrices in terms 
of others. Finding (or proving) the absolute minimum may be difficult. Moreover, 
the metric for minimization should be run time, not some arbitrary way of counting 
operations. Thus the right way to utilize the ideas we are presenting may be to 
identify sets of ways to evaluate finite element matrices. These could then be tested 
on different systems (architectures plus compilers) to see which is the best. It is 
amusing that it takes fewer operations to compute K e than it does to write it down, 
so it may be that memory traffic should be considered in an optimization algorithm 
that seems the most efficient algorithm. 

4. Evaluation of general multi-linear forms. General multi-linear forms can 
appear in finite element calculations. As an example of a trilinear form, we consider 
that arising from the first order term in the Navier-Stokes equations. Though ad- 
ditional issues arise over bilinear forms, many techniques carry over to give efficient 
algorithms. 

The local version of the form is defined by 

c e (u;v,w):= / u ■ Vv(i) ■ w(x) dx 

f d d 

= u j( x )^ — Vk(x)w k (x)dx 
■!>■ /T: dx * 

f d Q 
= I Uj(J£ + x e )- — Vfc(J£ + x e )w k (J£ + x e ) det(J) d£ 

JT 3,h=l dXj 



J < j.k.rn=\ \\eC / 




dxj \^ )d WJ^)\ ,4,, 



, P ec J 

^ dx~ Vk k W dct(J)x 



8 



R. C. KIRBY AND M. KNEPLEY AND A. LOGG AND L. R. SCOTT 



Thus we have found that 

d 

i-(e.X) t(e,fi) t(e,p) 



( U ,v,w) = e E u ) 

j,k—l \ : fi : pGC 

E |^ det ( J ) f MQ^MOMSK ( 4 - 2 ) 



m— 1 



" » at 

— ,/(e,A) t(e,p) t(e,p) t/?ro j . / j\ at 



where 



j,k=l X,fj,,p£C m=l J 

Nx„--= I MoirMOMO^ (4.3) 

To summarize, we have 

j,k — l \,fi,p£C 



(4.4) 



=E E ^-f' p) EE-f' A) ^ 

k=in,pec j=i \e£ 



where the element coefficients iV| • are defined by 

^, PJ ~ E ^ det(J)iV A w . =: E G mJ N XtlltPtm (4.5) 

m— 1 ^ m— 1 

where G m j := det(J). Recall that J is the Jacobian above, and J -1 is its inverse, 
and 

fj-i\ _ ^Cm 
I )m,j - g x . ■ 

Note that both A^ A;jU;j0; (.) and iV^ ^ p ^ can be thought of as d-vectors. Moreover 

N c M) ^det(J)N XM J-\ 

Also note that N\^ Pt (.) — N Pilii x^, so that considerable storage reduction could be 
made if desired. 

The matrix C defined by C\j = c(u, <f>j) can be computed using the assembly 
algorithm as follows. First, note that C can be written as a matrix of dimension 
\C\ x \C\ with entries that are d x d diagonal blocks. In particular, let Id denote the 
d x d identity matrix. Now set C to zero, loop over all elements and up-date the 
matrix by 

i=i xec 



=h E G ^ ( E u ) 

m,j = l \\ec 
=Id E 7mA^A,p,p,m 



i(e,A) ^ (4.6) 
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for all p and p, where 



7mA 



.7=1 



i(e,A) 



(4.7) 



It thus appears that the computation of C can be viewed as similar in form to (3.1 ) 



and similar optimization techniques applied. In fact, we can introduce the notation 
K e,u where 



K e,u 



m . A £ £ 



(4.8) 



Then the update of C is done in the obvious way with K e 



4.1. Trilinear Forms with Piecewise Linears. For simplicity, we may con- 
sider the piecewise linear case. Here, the standard mixed formulations are not inf-sup 
stable, but the trilinear form is still an essential part of the well-known family of 
stabilized methods 111) that do admit equal order piecewise linear discretizations. 
Moreover, our techniques could as well be used with the nonconforming linear ele- 
ment 6 , which does admit an inf-sup condition when paired with piecewise constant 
pressures. 



In the piecewise linear case, (4.3| simplifies to 



N; 



m Jf 



(4.9) 



We can think of N- 



x,fi, P ,m defined from two matrices: N\ tlitP>m = D^ m F x , p where 



D 



fi,m 



'i 



= 1 | (d = 2) and 



(l 0\ 

1 

1 

\i i ij 



(d = 3) 



(4.10) 



and 



(4.11) 



The latter matrix is easy to determine. In the piecewise linear case, we can compute 
integrals of products using the quadrature rule that is based on edge mid-points (with 
equal weights given by the area of the simplex divided by the number of edges). Thus 
the weights are to = 1/6 for d = 2 and u> — 1/24 for d = 3. Each of the values 4>\{£) 
is either | or zero, and the products are equal to -r or zero. For the diagonal terms 
X = p, the product is non-zero on d edges, so F\,a = 1/12 for d = 2 and 1/32 for 
d = 3. If A 7^ p, then the product ^a(C) < / > p(C) is non-zero for exactly one edge (the one 
connecting the corresponding vertices), so F\. p = 1/24 for d — 2 and 1/96 for d = 3. 
Thus we can describe the matrices F in general as having d on the diagonals, 1 on 
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Table 4.1 

The tensor N (multiplied by ninety-six) for piecewise linears in three dimensions represented 
as a matrix of four by three matrices. 
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the off-diagonals, and scaled by 1/24 for d = 2 and 1/96 for d = 3. Thus 



F = 



1 

4(d + l) 



(d 1 
1 d 

V 1 



1\ 
1 

d) 



d-1 
4(d+l)! 



'd+l 



4(d + l)! 



( l 1 
1 1 

V 1 



V 



(4.12) 



for d — 2 or 3, where Id denotes the d x d identity matrix. Note that for a given d, 
the matrices in (4.12) are d+lxd+lin dimension. 



The tensor N for linears in three dimensions is presented in Table 4.1 We see 
now a new ingredient for computing the entries of K e,u from the matrix "f m .\. Define 

7m = Et=i7m.A for m = 1,2,3, and then j miX = ^l m .\ + 7m for m = 1,2,3 and 
A = 1,2,3,4. Then 



R e,u = 



(ill 

712 
713 
\7l4 



721 
722 
723 
724 



731 
732 
733 
734 



711 
712 
713 
714 



721 + 731 \ 

722 + 732 

723 + 733 

724 + 734 J 



(4.13) 



However, note that the 7 m 's are not computations that would have appeared directly 
in the formulation of K e ' u but are intermediary terms that we have defined for con- 
venience and efficiency. This requires 39 operations, instead of 384 operations using 



(4.f 



4.2. Algorithmic implications. The example in Section [4] provides guidance 
for the general case. First of all, we see that the "vector" space of the evaluation 



problem (4.8 ) can be arbitrary in size. In the case of the trilinear form in Navier-Stokes 



considered there, the dimension is the spatial dimension times the dimension of the 
approximation (finite element) space. High-order finite element approximations |14j 
could lead to very high-dimensional problems. Thus we need to think about looking 
for relationship among the "computational vectors" in high-dimensional spaces, e.g., 
up to several hundred in extreme cases. The example in Section 4.1 is the lowest 
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Table 4.2 

Summary of results for FErari on triangular Lagrange elements 



Order 


Entries 


Base MAPs 


FErari MAPs 


1 


6 


24 


7 


2 


21 


84 


15 


3 


55 


220 


45 


4 


120 


480 


176 


5 


231 


924 


443 


6 


406 


1624 


867 



order case in three space dimensions, and it requires a twelve-dimensional space for 
the complexity analysis. 

Secondly, it will not be sufficient just to look for simple combinations to deter- 
mine optimal algorithms, as discussed in Section |3.1| The example in Section |4.1| 
shows that we need to think of this as an approximation problem. We need to look 
for vectors (matrices) which closely approximate a set of vectors that we need to com- 
pute. The vectors Vi = (1, 1, 1, 1, 0, . . . , 0), V 2 = (0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0), V 3 = 
(0, . . . , 0, 1, 1, 1, 1) are each edit-distance one from four vectors we need to compute. 
The quantities j m represent the computations (dot-product) with V m . The quanti- 
ties "/ m \ are simple perturbations of 7 which require only two operations to evaluate. 
A simple rescaling can reduce this to one operation. 

Edit-distance is a useful measure to approximate the computational complexity 
distance, since it provides an upper-bound on the number of computations it takes to 
get from one vector to another. Thus we need to add this type of optimization to the 
techniques listed in Section |3~T| 

4.3. The FErari system. The algorithms discussed in Sections |3.1| and |4.2| 

have been implemented in a prototype system which we call FErari, for Finite Ele- 
ment Re-arrangement Algorithm to Reduce Instructions. We have used it to build 
optimized code for the local stiffness matrices for the Laplace operator using Lagrange 
polynomials of up to degree six. While there is no perfect metric to predict efficiency 
of implementations due to differences in architecture, we have taken a simple model to 
measure improvement due to FErari. A particular algorithm could be tuned by hand, 
common divisors in rational arithmetic could be sought, and more care could be taken 
to order the operations to maximize register usage. Before discussing the implemen- 
tation of FErari more precisely, we point to Table [4~2| to see the levels of optimization 
detected. In the table, "Entries" refers to the number of entries in the upper triangle 
of the (symmetric) matrix. "Base MAPs" refers to the number of multiply add pairs 
if we were not to detect dependencies at all, and "FErari MAPs" is the number of 
multiply-add pairs in the generated algorithm. Although we only gain about a factor 
of two for the higher order cases, we are automatically generating algorithms with 
fewer multiply-add pairs than entries for linears, quadratics, and cubics. 

FErari builds a graph of dependencies among the tensors K^ iAl in several stages. 
We now describe each of these stages, what reductions they produce in assembling 
K e , and how much they cost to perform. We start by building the tensors {Kj i(1 } 
for 1 < A < dimPfe and A < [i < dimPft using FIAT 15j. In discussing algorithmic 
complexity of our heuristics below, we shall let n be the size of this set. Throughout, 
we are typically using "greedy" algorithms that quit when they find a dependency. 
Also, the order in which these stages are performed matters, as if a dependency for a 
tensor is found at one stage, will not mark it again later. 
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Table 4.3 

Results of FErari on Lagrange elements of 



1 through 6 



Order 


Entries 


Zero 


Eq 


Eq t 


1 Entry 


Col 


Edl 


Ed2 


LC 


Default 


Cost 


1 


6 











3 





2 





1 





7 


2 


21 


3 


2 


3 


4 


2 


5 


1 


1 





15 


3 


55 


3 


17 





5 


8 


12 


5 


5 





45 


4 


120 





23 





7 


2 


25 


30 


25 


8 


176 


5 


231 





18 





13 


5 


41 


71 


45 


38 


443 


6 


406 





27 





17 


7 


59 


139 


61 


96 


867 



As we saw earlier, sometimes Ka. p = 0. In this case, (K e )\ yfl = as well for any 
e, so these entries cost nothing to build. Searching for uniformly zero tensors can be 
done in n operations. While we found that three tensors vanish for quadratics and 
cubics, none do for degrees four through six. 

Next, if two tensors Ka. m and Kv.^' are equal, then their dot product into G e 
will also be equal. So, we mark Kv |(J ' as depending on K>^ t . Naively, searching 
through the nonzero tensors is an 0(n 2 ) process, but can be reduced to 0(nlog(n)) 
operations by inserting the tensors into a binary tree using lexicographic ordering of 
the entries of the tensors. The fourth column of Table 4.3, titled "Eq", shows the 
number of such dependencies that FErari found. 

As a variation on this theme, if Ka, m = (Ky^'Y, then Kx, p : G e — (Kv, P ' : G e 
since G e is symmetric. Hence, once {K e )x^ is computed, {K e )y^> is free. We may 
also search for these dependencies among nonzero tensors that are not already marked 
as equal to another tensor in 0{n log(n)) time by building a binary tree. In this case, 
we compare by lexicographically ordering the components of the symmetric part of 
each tensor (recall the symmetric part of K is K \ )■ Equality of symmetric parts is 
necessary but not sufficient for two matrices to be trasposes of each other. So, if our 
insertion into the binary tree reveals an entry with the same symmetric part, we then 
perform an additional check to see if the two are indeed transposes. Unfortunately, we 
only found such dependencies for quadratics, where we found three. This is indicated 



in the fifth column of Table [473} titled "Eq t." 

So far, we have focused on finding and marking tensors that can be dotted into G e 
with no work (perhaps once some other dot product has been performed). Now, we 
turn to ways of finding tensors whose dot product with G e can be computed cheaply. 
The first such way is to find unmarked tensors K^ ^ with only one nonzero entry. 



This may be trivially performed in O(n) time. The sixth column of Table 4.3 titled 
"1 Entry," shows the number of such tensors for each polynomial degree. 



If there is a constant a such that K> 



aK.\,u- then K> 



G e = aK 



A.// 



G e 



and hence may be computed in a single multiply. Moreover, colinearity among the 
remaining tensors may be found in 0(n log (n)) time by a binary tree and lexicographic 
ordering of an appropriate normalization. To this end, we divide each (nonzero) tensor 
by its Frobenius norm and ensure that its first nonzero entry is positive. If insertion 
into the binary tree gives equality under this comparison, then we mark the tensors 
as colinear. The numbers of such dependencies found for each degree is see in the 
"Col" column Table El 

Next, we seek things that are close together in edit distance, differing only in 
one or two entries. Then, the difference between the dot products can be computed 
cheaply. For each unmarked tensor, we look for a marked tensor that is edit distance 
one away. We iterate this process until no more dependencies are found, then search 
for dependencies among the remainders. We repeat this process for things that are 
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Fig. 4.1. Generated code for computing the stiffness matrix for linear basis functions 
from Numeric import zeros 
G=zeros(4,"d") 
def K(K, jinv) : 

detinv = 1 . 0/ (jinv [0 , 0] * jinv [1 , 1] - j inv [0 , 1] *j inv [1 ,0] ) 

G[0] = ( jinv[0,0]**2 + jinv[l,0]**2 ) * detinv 

G[l] = ( j inv [0, 0] * j inv [0,1] +j inv [1,0]* jinv [1,1] ) * detinv 

G[2] = G[l] 

G[3] = ( jinv[0,l]**2 + jinv[l,l]**2 ) * detinv 

K[l,l] = 0.5 * G[0] 

K[1,0] = -0.5 * G[l]- K[l,l] 

K[2,l] = 0.5 * G[2] 

K[2,0] = -0.5 * G[3]- K[2,l] 

K[0,0] = -1.0 * K[1,0] + -1.0 * K[2,0] 

K[2,2] = 0.5 * G[3] 

K[0,1] = K[1,0] 

K[0,2] = K[2,0] 

K[l,2] = K[2,l] 

return K 



edit distance two apart. In general, this process seems to be 0(n ). Table 4.3 has 
two columns, titled "Edl" and "Ed2," showing the number of tensors we were able 
to mark in such a way. 

If a tensor is a linear combination of two other tensors, then its dot product can be 
computed in two multiply-add pairs. This is an expensive search to perform (0(n 3 )), 
since we have to search through pairs of tensors for each unmarked tensor. However, 
we do seem to find quite a few linear combinations, as seen in the "LC" column of 
|4.3| Any remaining tensors are marked as "Default" in which case they are computed 
by the standard four multiply-add pairs. 

After building the graph of dependencies, we topologically sort the vertices. This 
gives an ordering for which if vertex u depends on vertex v, then v occurs before 
u, an essential feature to generate code. We currently generate Python code for 
ease in debugging and integrating with the rest of our computational system, but we 
could just as easily generate C or Fortran. In fact, future work holds generating not 
particular code, but abstract syntax as in PETSc 3.0 [TH] as to enable code generation 
into multiple languages from the same graph. One interesting feature of the generated 
code is that it is completely unrolled - no loops are done. This leads to relatively 
large functions, but sets up the code to a point where the compiler really only needs to 
handle register allocation. In Figure |4~T) we present the generated code for computing 
linears. 

4.4. Code verification. In general, the question of verifying a code's correct- 
ness is difficult. In this case, we have taken an existing Poisson solver and replaced 
the function to evaluate the local stiffness matrix with FErari-generated code. The 
correct convergence rates are still observed, and the stiffness matrices and computed 
solutions agree to machine precision. However, in the future, we hope to generate 
optimized code for new, complicated forms where we do not have an existing verified 
implementation, and the general question of verifying such codes is beyond the scope 
of this present work. 
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5. Computing a matrix via quadrature. The computations in equations 



(2.6-2.7) can be computed via quadrature as 



d 

£(EH m,n— 1 

d (5.1) 

m,n— 1 
,/ 



where the coefficients Kx >t i, m ,n are analogous to those defined in (2.3), but here they 
are defined by quadrature: 

Kx,^,m,n = y^^g0A,m(O^M,"(O (5.2) 
?6S 



(The coefficients are exactly those of (2.3) if the quadrature is exact.) 



The right strategy for computing a matrix via quadrature would thus appear to be 
to compute the coefficients Kx,u, m ,n first using (5.2), and then proceeding as before. 



However, there is a different strategy associated with quadrature when we want only 
to compute the action of the linear operator associated with the matrix and not the 
matrix itself (cf. [16]). 

6. Other approaches. We have presented one approach to optimize the com- 
putation of finite element matrices. Other approaches have been suggested for op- 
timizing code for scientific computation. The problem that we are solving can be 
represented as a sparse-matrix-times-full-matrix multiply, ICQ, where the dimensions 
of K. are, for example, \C\ 2 x d for Laplace's equation in d-dimensions using a (local) 
finite-clement space C. For the non-linear term in Navier-Stokes, the dimensions of 
IC become \C\ 2 x d\C\. The first dimension of Q of course matches the second of IC, 
but the second dimension of Q is equal to the number of elements in the mesh. Thus 
it is worthwhile to do significant precomputation on IC. 

At a low level, ATLAS [26] works with operations such as loop unrolling, cache 
blocking, etc. This type of optimization would not find the reductions in computation 
that FErari does, since the latter identifies more complex algebraic structures. In this 
sense, our work is more closely aligned with that of FLAME [7] which works with 
operations such as rank updates, triangular solves, etc. The BeBOP group has also 
worked on related issues in sparse matrix computation [22j [24J, [23J [25j [12] . Our work 
could be described as utilizing a sparsity structure in a matrix representation, and 
it can be written as multiplying a sparse matrix times a very large set of (relatively 
small) vectors (the columns of Q). This is the reverse of the case considered in the 
SPARSITY system |13j . where multiplying a sparse matrix times a small number of 
large vectors is considered. 

Clearly the ideas we present here could be coupled with these approaches to 
generate improved code. The novelty of our work resides in the precomputation that 
is applied to evaluate the action of IC. In this way, it resembles the reorganization of 
computation done for evaluation of polynomials or matrix multiply [IT] . 
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Quadrature 


Tensor 


FFC 


FErari 


Assemble 


Matvec 


Linear 
Quadratic 


0.3802 
2.0000 


0.0725 
0.3367 


0.0535 
0.1517 


0.0513 
0.1506 


0.4762 
1.9342 


0.0177 
0.1035 



Table 7.1 

Seconds to process one million triangles: local stiffness matrices, global matrix insertion, and 
matrix-vector product 



7. Timing results. We performed several experiments for the Poisson problem 
with piecewise linear and piecewise quadratic elements. We set up a series of meshes 
using the mesh library in DOLFIN 10J. These meshes contained between 2048 and 
524288 triangles. We timed computation of local stiffness matrices by several tech- 
niques, their insertion into a sparse PETSc matrix [3], multiplying the matrix onto 
a vector, and solving the linear system. All times were observed to be linear in the 
number of triangles, so we report all data as time per million triangles. All computa- 
tions were performed on a Linux workstation with a 3 GHz Pentium 4 processor with 
2GB of RAM. All our code was compiled using gec with "-03" optimization. 

Our goal is to assess the efficacy and relevance of our proposed optimizations. 
Solver technology has been steadily improving over the last several decades, with 
multigrid and other optimal strategies being found for wider classes of problems. 
When such efficient solvers are available, the cost of assembling the matrices (both 
computing local stiffness matrices and inserting them into the global matrix) is much 
more important. This is especially true in geometric multigrid methods in which 
a stiffness matrix can be built on each of a sequence of nested meshes, but only a 
few iterations are required for convergence. To factor out the choice of solver, we 
concentrate first on the relative costs of building local stiffness matrices, inserting 
them into the global matrix, and applying the matrix to a vector. 

We used four strategies for computing the local stiffness matrices - numerical 
quadrature, tensor contractions (four flops per entry), tensor contractions with zeros 
omitted (this code was generated by the FEniCS Form Compiler [H]), and the 
FErari-optimized code translated into C. For both linear and quadratic elements, 
the cost of building all of the local stiffness matrices and inserting them into the 
global sparse matrix was comparable (after storage has been preallocated). In both 
situations, computing the matrix-vector product is an order of magnitude faster than 
computing the local matrices and inserting them into the global matrix. These costs, 
all measured seconds to process one million triangles, are given in Tableland plotted 
Figure [7] using log-scale for time. 

We may draw several conclusions from this. First, precomputing the reference 
tensors leads to a large performance gain over numerical quadrature. Beyond this, 
additional benefits are included by omitting the zeros, and more still by doing the 
FErari optimizations. We seem to have optimized the local computation to the point 
where it is constrained by read-write to cache rather than by floating point optimiza- 
tion. Second, these optimizations reveal a new bottleneck: matrix insertion. This 
motivates studying whether we may improve the performance of insertion into the 
global sparse matrix or else implementing matrix-free methods. 

While the FErari optimizations are highly successful for building the matrices, 
there is still quite a bit of solver overhead to contend with. We used GMRES (bound- 
ary conditions were imposed in a way that broke symmetry) preconditioned by the 
BoomerAMG method of HYPRE [3]. For both linear and quadratics, the Krylov 
solver converged with only three iterations. However, the cost of building and apply- 
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Seconds per million triangles 



Time (s) 




Fig. 7.1. Seconds to process one million triangles: local stiffness matrices, global matrix inser- 
tion, and matrix-vector product 



ing the AMG preconditioner is very large. Using numerical quadrature, building local 
stiffness matrices accounted for between five and nine percent of the total run time 
(building the local to global mapping, computing geometry tensors, computing local 
stiffness matrices, sorting and inserting local matrices into the global matrix, creating 
and applying the AMG preconditioner, and the rest of the Krylov solver). Keeping 
everything else constant and switching to the FErari optimized code, building local 
stiffness matrices took less than one percent of total time. Geometric multigrid al- 
gorithms tend to be much more efficient, and we conjecture that the cost of building 
local matrices would be even more important in this case. 

8. Conclusions. The determination of local element matrices involves a novel 
problem in computational complexity. There is a mapping from (small) geometry 
matrices to "difference stencils" that must be computed. We have demonstrated the 
potential speed-up available with simple low-order methods. We have suggested by 
examples that it may be possible to automate this to some degree by solving abstract 
graph optimization problems. 

9. Acknowledgements. We thank the FEniCS team, and Todd Dupont, Johan 
Hoffman, and Claes Johnson in particular, for substantial suggestions regarding this 
paper. 
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